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ABSTRACT 

This paper explores the comparability of item 
calibrations for three types of items: (1) text only; (2) text with 
photographs; and (3) text plus graphics when items are presented on 
written tests and computerized adaptive tests. Data are from five 
different medical technology certification examinations administered 
nationwide in 1993. The Rasch model was used to calibrate items for 
the two test formats. Item calibrations obtained from each 
administrative mode were then compared. No significant differences 
were found between text only item calibrations obtained from the 
written tests and the computerized adaptive test. While some items 
with photographic or figure accompaniment showed slightly different 
item calibrations between the administrative modes, nonstatistical 
explanations explain most of the minor differences discovered. The 
results of this investigation confirm that Rasch item calibrations 
from written tests are appropriate for use on computerized adaptive 
tests. Included are four tables and six figures. (Contains 7 
references .) (Author/SLD) 
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Item Calibration Considerations: A Comparison of Item 
Calibrations on Written and Computerized Adaptive Examinati 



Abstract 

This paper explores the comparability of item calibrations *or 
three types of items, 1) text only, 2) text with photographs, 3) 
text plus graphics when items are presented on written texts and 
computerized adaptive tests. Data are from five different medical 
technology certification examinations administered nationwide in 
1993. The Rasch model was used to calibrate items for the two test 
formats. Item calibrations obtained from each administrative mode 
were then compared. No significant differences were found between 
text only MCQ item calibrations obtained from the written tests and 
the computerized adaptive test. While some items with photographic 
or figure accompaniment showed slightly different item calibrations 
between the administrative modes, non-statistical explanations 
explain most of the minor differences discovered. The results of 
this investigation confirm that Rasch item calibrations from 
written tests are appropriate for use on computerized adaptive 
tests . 



Item Calibration Considerations: A Comparison of Item 
Calibrations on Written and Computerized Adaptive Examinati 



ons 



Green (1988) suggests that item equivalency between CAT and 
written administrations is really a problem of scaling rather than 
equating. Even when there is a shift in item calibrations due to 
mode of administration, the scales of the tests are equivalent in 
construct. if results are reproducible in either mode, we can 
assert that the modes of administration produce comparable 
results. 

The equivalency of item calibrations may also be addressed in 
terms of stability. m their work on item calibration stability 
across modes of administration, Bergstrom and Lunz (m press) 
recalibrated items using data from a computerized adaptive test and 
compared those recalibrations to the calibrations obtained for the 
same items from traditional written test administration. Ninety- 
eight percent of the item calibrations remained stable across 
administration modes when the shift in scale (standard deviation) 
between CAT and written was accounted for. 

One of the specifications of the Rasch model (Rasch, i960; 
Wright * stone, 1979) is that measurement is independent of the 
specific set of items used to measure ability and thus the same 
conclusions should be made about a person's ability regardless of 
the subset of items taken. This is extremely important in CAT 
because examinees take different subsets of items, all of which are 
calibrated to the same scale. Thus the stability of 
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calibrations is critical for consistent interpretation of examinee 
performance . 

While text-only MCQ item calibrations obtained from 
computerized adaptive and written tests have been shown to be 
highly stable (Bergstrom & Lunz, In Press) little has been done to 
investigate the performance comparability of MCQ items with 
accompanying graphical representations presented on the computer 
screen along with the item. Do these MCQ items with screen 
graphics function comparably in computerized and written formats? 

This paper explores the comparability of item calibrations 
when presented on screen and in written format of three different 
types of items: 1) text only, 2) text plus photographs, and 3) text 
plus non-photographic figures or charts. The compared item 
calibrations are from two different administrative modes, written 
and computerized adaptive. Three research questions are addressed. 
First, are item calibrations for text only MCQ items (which include 
no visual/graphical presentations) comparable when calculated from 
written and computerized adaptive response data. Second, are item 
calibrations for MCQ items with accompanying visual photographs 
compatible on written tests (with photographs printed in a book) 
and on computerized adaptive tests (with photographs shown on the 
screen) . Third, are MCQ item calibrations with accompanying 
graphics (charts, tables, non-photographic figures) from written 
test data (with the figure printed in a book) comparable to item 
calibrations from computerized adaptive test data (when the figure 
is shown on the screen)? 
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Methods 

Data 

Data are from three different certification examinations that 
were administered in 1993. Each examination was administered in 
both the written and computerized adaptive format- item 
calibrations for the written tests were derived from large groups 
of students who took each item in the written format (Test A = 
1052; Test B = 731; Test C = 641). Because of the nature of 
computer adaptive testing (examinees may not all see the same 
items) smaller numbers of examinees saw each item in the CAT 
format. Item calibrations from the CAT exam were derived from 
varying numbers of examinee responses, minimum of 17-20, maximum of 
35. 

The exams were high stakes, so examinees were highly motivated 
to be successful. Three different examinations were analyzed t 
achieve more generalizable results and avoid the possibility of 
identifying test specific patterns. This also increased the number 
of items which presented visual material. The total number of 
items compared was 54 (20 text only MCQ, 34 MCQ with accompanying 
figures or photographs) . 
Design 

The Rasch model was used to calibrate items for the two test 
formats, written and computerized adaptive, after the testing was 
completed. Mean and standard deviation differences were accounted 
for by a linear transformation that placed the written item 
calibrations on the same scale as the calibrations from zhe CAT: 
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y=a+b(x) CD 

where a = intercept, b = slope, x = original measure and y = 
transformed measure. 

This corrected for any "differences of scale" that may have existed 
(see Tables 1, 2 & 3 for exact formulae). 

Item calibrations obtained from each administrative mode were 
compared using standardized differences calculated as: 

z = j : = (2) 

where d = item calibration (difficulty) and s = standard error. 

Written and CAT calibrations for each group of items were then 
plotted using 95% quality control lines to identify the items with 
significantly different item calibrations. Where differences were 
found, the items were examined. 

In addition, for purposes of qualitative discussion, text 
items with non-photographic figures or charts were categorized into 
three levels of complexity (see figure 1 for specific examples) . 



Insert Figure 1 about here 



Simple figures were those X-Y graphs without number markings and 
figures with little or no labelling. Average figures were typical 
X-Y graphs and figures with small or complex labelling. Complex 
figures included large tables of numbers, complex graphs and very 
fine print. Our hypothesis was that if calibration differences 



appeared, they would most likely involve items with complex 

figures. 

Results 

Most 'text only' item calibrations obtained from the written 
and computerized adaptive administrative formats were comparable, 
and fell within the 95% confidence band. Two sets of 10, Text only 
MCQ items, were drawn for each comparison. Item calibrations for 
each group are presented in Tables 1 and 2 and are compared in 
Figures 2 and 3 respectively. 



Insert Tables 1 & 2 and Figures 2 & 3 about here 



When scales were adjusted for variance, item calibrations for most 
items on both modes fell within the standard error of measurement, 
as expected. 



Insert Table 3 and Figure 4 about here 



Table 3 presents calibrations for 13 MCQ items with 
accompanying figures or photographs drawn from three different 
tests (a,b,c). Figure 4 shows that of these items with photographs 
or figures, only 4 showed different item calibrations between the 
written and computerized adaptive modes. The content and format of 
these items were reviewed. 

Items A and B appeared more difficult on the written test than 
in the on-screen computerized adaptive mode. Item A was accompanied 
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by a photograph which was larger on the screen than in print. The 
enlargement made the image clearer, thus making the item easier. It 
could be argued that because of the qualitative difference in the 
clarity of the photograph presented, Item A should instead be 
considered as a different item in each of the administrative modes. 
Item B was accompanied by a photograph that was of relatively poor 
quality in both written and computer presentations. However, when 
content experts evaluated the item they concluded that it could be 
answered effectively without the aid of the graph. 

Items C and D appeared slightly more difficult when presented 
on the screen in the computerized adaptive mode than on the written 
test. For item C, no identifiable reason for a difference could be 
found. Item D was accompanied by an X-Y plot of average complexity. 
Perhaps figures, tables, and other representations that include 
numbers are looked at on-screen in a different way than in print. 
We hypothesized that reading numeric tables and charts on screen 
may be more difficult that in print. 

To test this hypothesis, additional items with charts, tables 
and other non-photographic accompaniments were selected from across 
three tests (see Figure 5) . Table 4 presents calibrations for MCQ 
items with accompanying charts, tables or non photographic 
material, across these three tests (a,b,c). 



Insert Table 4 and Figure 5 about here 



Written and CAT calibrations for the majority of the MCQ items 
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accompanied by non-photographic figures or charts nad comparable 
item calibrations. In general, the items appeared equally difficult 
in the written and computerized adaptive mode. The level of each 
figure (complex, average and simple) is noted in Figure 5. It is 
clear that the simple items (labeled S) have comparable 
calibrations between the two administrative modes. Average items 
(labeled M) and complex items are generally equivalent. Only one 
average and one high complexity items had significantly different 
calibrations across modes of administration. 

Two items were identified as calibrating differently in the 
CAT and written modes. These items (see figure 5) are located to 
the right of the 95% confidence band, suggesting that they were 
more difficult on CAT than on the written test. One item (labelled 
M) is described as average and the other (labelled H) as complex. 
It was determined that the average item (M) was answerable without 
reference to the figure. The complex item (H) figure used a much 
smaller type size in its labelling, possibly making, the item more 
difficult to read on the screen. As noted in an earlier question, 
this qualitatively changes the item and may account for the 
difference in calibration. 



Insert Figure 6 about here 



Analysis of the standardized differences (Z-scores) between 
the written and CAT item calibrations, indicates a very strong 
trend of equivalence. Figure 6 plots item calibrations against the 
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standardized differences. A Z-score ±2.00 indicates 99% confidence 
that the item calibrations are different. However, all Z-scores 
were within ±2.00 and most or all residents were ±1.5 or less. The 
mean difference of .07 falls well within one standard deviation of 
the Z-distribution' s mean of zero and thus supports this finding of 
equivalence* 
Conclusions 

With the increase in popularity of computerized adaptive 
testing, many organizations are converting written examinations to 
the CAT format. In this conversion, it is important to be confident 
that item calibrations are comparable when calculated using data 
from written and CAT administrations. It is equally important to 
acknowledge that some items may not function in precisely the same 
way in written and CAT format because of a change in the examinee's 
perception or because of some qualitative factor related to the 
comparability of the item across formats. Our initial 

investigation found only a few substantial differences in item 
calibrations across the two administration modes. 

Text only MCQ items were found to be equivalently calibrated 
on written and CAT formats. Similarly, most of the MCQ items with 
photographic or figure accompaniment in our sample were comparably 
calibrated. In those few instances where differences were found, 
the differences could generally be explained by content or other 
non-statistical, qualitative factors. When images are taken from 
a printed form and placed onto the screen in a CAT format, careful 
attention is required to insure that the images are indeed 
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interpretable. Differences in picture clarity, image size, text 
complexity may all contribute to the image being qualitatively 
different. Su'ch substantive issues are extremely important in 
moving from a written to a CAT format, where printed photographs 
and figures are identical to the on screen images, the item 
calibrations seem as stable across administrative modes as text 
only MCQ items. 

While a number of complex figures were used in the 
comparisons, further investigation should include the analysis of 
additional items with complex non-photographic chart or figure 
accompaniments . 
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Figure 1: Examples of Figure Complexity Levels 



SIMPLE 



ALBUMIN 



•5 

I 

r. 




AVERAGE 




10 20 30 40 SO 00 TO 0O 00 MO 



p0 2 (mm Hg) 



I ' 5 5 i 



COMPLEX 



















1AMH£ 





























1 


















I 

i 
i 
i 
i 









300 tOO 500 

»12 




■1 X»lC0it^/dL 
7 50 ■ *4 fT^.'dL 



nnnnnnoBSi 



12 X - 64 m^/± 
2 50 • i4 mg/dL 



ERIC 



13 

14 



Table 1; Item Calibrations (Text Only MCO Items) 
Test 1 (shown in Figure 2) 



Item 



CAT 

Calib SE 



- Written - 
Calib SE 



CAT-Written 
Residual 



1 

2 
3 
4 
5 
6 
7 
8 
9 

10 



2.57 
2.07 
1. 02 
0.89 
0.45 
-0. 17 
-0.75 
-0. 35 
-1.76 
-2. 19 



31 

23 
25 
27 
26 
22 



0.43 
0.27 
0.31 
0.45 



2.79 
2.46 
0. 35 
0.85 
0.33 
-0.20 
-1. 27 
-0.96 
-1.52 
-1.55 



0. 
0. 
0. 
0. 
0. 



03 
03 
04 
05 
03 



-0. 22 
-0.39 
0. 67 
0.04 
0. 12 



0.04 
0. 06 
0. 05 
0.05 
0. 04 



0. 
0, 
0. 



03 
52 
11 



-0.24 
-0. 64 



0.13 SD = 1.56 * 



N Items = 10 

Range of persons used to calibrate CAT items = 20-33 
N of persons used to calibrate written items = 1052 

Linear Transformation of Written Scores to CAT scale: y=.01 + .I63x 

Table 2: Item Calibrations (Text Only MCO Items) 
Test 2 (shown in figure 3) 



Item 



CAT 

Calib SE 



- Written - 
Calib SE 



CAT-Written 
Residual 



1 
2 
3 
4 
5 
6 
7 
8 
9 

10 



1. 
1. 
1. 
1. 
0. 
0. 
0. 



59 
06 
06 
02 
76 
51 
51 



-0.06 
-0. 80 
-0.84 



0, 

0, 

0 

0, 

0 

0 

0, 

0, 

0, 

0, 



21 
22 
29 
28 
32 
42 
33 
23 
28 
31 



1.45 
1. 00 
1.21 
1. 01 
0.43 



0. 03 
0. 06 



0. 

0, 



14 
06 



1. 
0. 



06 
20 



-0.01 
-0.56 



0, 
0 , 
0. 
0. 
0. 
0. 
0. 



04 
03 
03 
04 
03 
03 
04 



-1.00 0.03 



-0. 15 
0.01 
0.33 

-0.55 
0.31 

-0. 05 

-0. 24 
0. 16 



SD = 0.81 * 



X = 0.48 

N Ileitis = 10 

Range of persons used to calibrate CAT items = 20-35 
N of persons used to calibrate written items = 731 

Linear Transformation of Written Scores to CAT scale: y=.25 + . 564x 



Since written measures were transformed onto the CAT scale, means 
and standard deviation units for each group are identical. 
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Item Calibration Comparison 
Written versus Computer Adaptive Mode 




Item Calibration Comparison 
Written versus Computer Adaptive Mode 

Calibration on Written 




N m 10 itcmi Calibration on CAT 



Figure 2 
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Table 3; Item Calibrations CMCQ w/ Figures or Photographs^ 

(shown in figure 4) 



Item 



CAT — 

Calib SE 



- written - 
Calib SE 



CAT-Written 
Residual 



la 

2a 

3a 

4a 

5b 

6b 

7c 

8c 

9c 

10c 

11c 

12c 

13c 



0.74 

0.64 
-0. 38 
-0.38 

0.10 
-0. 65 

0. 07 



N Items: 



Test a = 
Test b = 
Test c - 



46 

, 02 
17 
,88 
, 08 
37 



4 
2 



0, 
0, 
0, 

•0. 

•0. 

■0. 

•0. 
0. 

1. 

0, 

1. 

0. 

1. 



39 
23 
77 
15 
91 
28 
01 
39 
32 
99 
12 
91 
28 



0. 05 
0.04 
0. 06 
0. 06 



0. 
0. 
0. 
0. 
0. 
0. 
0. 
0. 
0. 



06 
07 
05 
07 
04 
05 
05 
06 
07 



0. 35 

0. 41 
-1.15 
-0.23 

1. 01 
-0.37 

0. 08 

1.07 t 
-0.3 0 

0. 18 
-0.24 

0. 17 
-0.91 t 



Range of persons used to calibrate CAT items = 17 

N of persons used to calibrate written items: 

Test a = 1052 
Test b = 731 
Test c = 641 

Linear Transformation of Written Scores to CAT scale: 
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Test a: y=.01 + .763x 
Test b: y=. 25 + . 564x 
Test c: y=.02 + .Q21x 



X = 0.13 
X = 0.48 
X = 1.05 



SD 
SD 
SD 



1.56 *] 
0.81 *] 
1.17 '] 



t See four items outside the 95% confidence band (Figure 4) 

Since written measures were transformed onto the CAT scale, means 
and standard deviation units for each group are identical. 
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Item Calibration Comparison 
PMnted Graphics vs On-Screen Graphics 



Calibration In Print 




Figure 4 
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Table 4; Item Calibrations (MCO w/ Figures) 
(shown in figure 5) 



CAT 

Item Calib SE 



la 

2a 

3a 

4a 

5a 

6a 

7a 

8b 

9b 

10b 

lib 

12b 

13b 

14b 

15b 

16c 

17c 

18c 

19c 

20c 

21c 



•0, 

-0, 
0, 

•0, 
0 , 
2 , 
0, 

•0, 
0. 

-0, 
0, 
0, 
0, 



0. 

0 . 
•0, 
"0, 

0. 

0. 



16 
12 
27 
36 
l r j 
07 
20 
83 
01 
92 
24 
17 
65 



0 29 
0. 14 



75 
15 
09 
89 
75 
12 



0. 44 
0.46 
0. 78 



0, 
0, 
0, 
0, 
0, 
0, 
0. 
0, 
0, 



31 
21 
75 
45 
43 
20 
21 
37 
42 



0.32 
0.39 



0. 
0. 



41 

34 



0. 50 
0.52 
0.55 
0.38 
0.48 



- Written 
Calib SE 



0.19 
-0.57 
•1.44 
-0. 12 

0.13 

1.12 
-0.22 
-0.20 
-0.09 
-0.96 

0.84 
-0.49 

0.31 

0.77 
•0.26 

0.86 
■0.20 

0.25 
■0.27 

0.47 

0.75 



0. 
0. 
0. 
0, 
0, 
0, 
0, 



06 
05 
10 
07 
04 
05 
06 



0. 03 
0.05 



0. 
0. 
0. 
0. 
0, 
0, 
0. 
0. 
0. 
0. 
0, 
0. 



04 
06 
07 
05 
05 
06 
04 
05 
07 
04 
04 
05 



CAT-Written 
Residual 



-0.35 
0.45 
1.71 
-0.24 
-0. 06 
-0.95 
0.42 
-0.63 
0. 10 
0. 04 
-0.60 



0. 
0, 



66 
34 



-0.48 
0.40 

-0.11 
0.35 

-0.34 

-0.62 
0.28 

-0.63 



N Items: 



Test a = 7 
Test b = 8 
Test c = 6 

Range of persons used to calibrate CAT items =17-31 

N of persons used to calibrate written items: 

Test a = 1052 
Test b = 731 
Test c = 641 

Linear Transformation of Written Scores to CAT scale: 



Test a: y=.01 + .763* 
Test b: y=.25 + .564* 
Test c: y=.02 + . 827* 



[ X = 0.13 SD = 1.56 *] 
[ X = 0.48 SD = 0.81 *] 
[ X = 1.05 SD = 1.17 *] 



t See two items outside the 95% confidence band (Figure 5) 

Since written measures were transformed onto the CAT scale, means 
and standard deviation units for each group are identical. 
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Figure Calibrations 
In Print versus On-Screen 
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Figure 5 



In Print vs On-Screen Figure Calibration 
Standardized Residual Comparison 
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